1
大语言模型架构的演进:从 BERT 到 GPT 与 T5
AI012Lesson 2
00:00

Transformer 架构的三重范式

大型语言模型的演进标志着一场范式转变:从特定任务模型转向“统一预训练”模式,即单一架构可适应多种自然语言处理需求。

这一转变的核心是自注意力机制,它使模型能够衡量序列中不同词语的重要性:

$$Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

1. 编码器仅用型(BERT)

  • 机制:掩码语言建模(MLM)
  • 行为特征:双向上下文;模型能一次性“看到”整个句子,以预测被遮蔽的词语。
  • 适用于:自然语言理解(NLU)、情感分析和命名实体识别(NER)。

2. 解码器仅用型(GPT)

  • 机制:自回归建模
  • 行为特征:从左到右的处理方式;严格基于前序上下文预测下一个标记(因果掩码)。
  • 适用于:自然语言生成(NLG)与创意写作。这构成了 GPT-4、Llama 3 等现代大语言模型的基础。

3. 编码器-解码器型(T5)

  • 机制:文本到文本迁移变换器(Text-to-Text Transfer Transformer)。
  • 行为特征:编码器将输入字符串转换为密集表示,解码器则生成目标字符串。
  • 适用于:翻译、摘要生成及对等任务。
关键洞察:解码器主导地位
业界已基本集中于解码器仅用型架构,因其更优的缩放定律和在零样本场景下的涌现推理能力。
显存上下文窗口影响
在解码器仅用模型中,KV 缓存随序列长度线性增长。一个 10 万上下文窗口所需的显存远超 8 千窗口,因此在未量化的情况下,本地部署长上下文模型极具挑战性。
arch_comparison.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
Question 1
Why did the industry move from BERT-style encoders to GPT-style decoders for Large Language Models?
Decoders scale more effectively for generative tasks and follow-up instructions via next-token prediction.
Encoders cannot process text bidirectionally.
Decoders require less training data for classification tasks.
Encoders are incompatible with the Self-Attention mechanism.
Question 2
Which architecture treats every NLP task as a "text-to-text" problem?
Encoder-Only (BERT)
Decoder-Only (GPT)
Encoder-Decoder (T5)
Recurrent Neural Networks (RNN)
Challenge: Architectural Bottlenecks
Analyze deployment constraints based on architecture.
If you are building a model for real-time document summarization where the input is very long, explain why a Decoder-only model might be preferred over an Encoder-Decoder model in modern deployments.
Step 1
Identify the architectural bottleneck regarding context processing.
Solution:
Encoder-Decoders must process the entire long input through the encoder, then perform cross-attention in the decoder, which can be computationally heavy and complex to optimize for extremely long sequences. Decoder-only models process everything uniformly. With modern techniques like FlashAttention and KV Cache optimization, scaling the context window in a Decoder-only model is more streamlined and efficient for real-time generation.
Step 2
Justify the preference using Scaling Laws.
Solution:
Decoder-only models have demonstrated highly predictable performance improvements (Scaling Laws) when increasing parameters and training data. This massive scale unlocks "emergent abilities," allowing a single Decoder-only model to perform zero-shot summarization highly effectively without needing the task-specific fine-tuning often required by smaller Encoder-Decoder setups.